Project Description

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio. You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan. Objective

1. To predict whether a liability customer will buy a personal loan or not.
2. Which variables are most significant.
3. Which segment of customers should be targeted more.

Data Dictionary

Best Practices for Notebook : • The notebook should be well-documented, with inline comments explaining the functionality of code and markdown cells containing comments on the observations and insights. • The notebook should be run from start to finish in a sequential manner before submission. • It is preferable to remove all warnings and errors before submission. • The notebook should be submitted as an HTML file (.html) and as a notebook file (.ipynb) Submission Guidelines :

1. There are two parts to the submission: 
    1. A well commented Jupyter notebook [format - .ipynb]
    2. File converted to HTML format 
2. Any assignment found copied/ plagiarized with other groups will not be graded and awarded zero marks
3. Please ensure timely submission as any submission post-deadline will not be accepted for evaluation
4. Submission will not be evaluated if,
    1. it is submitted post-deadline, or,
    2. more than 2 files are submitted

Insights:

We will drop the entries with no ZIP_City /ZIP_State as this is a difficult column to impute.

Observations

Univariate analysis - Bivariate analysis

Key meaningful observations on the relationship between variables

Observations on Age

Observation on Income

Observation on Education

Observation on Mortgage

Observation on Personal_Loan

-90% + Most of the customers did not accept personal loans

Observation on Securities_Account

-89%+ Most of the customers did not accept Securities_Account loans

Observation on CD_Account

-93 %+ of the customers did not accept Securities_Account loans

Observation on Online

- 59%+ users uses Internet banking facility

Bi Variate Analysis

Observation on Family versus Personal_Loan
Observation on Largest Mean ZIP_City versus Mortgage
Observation on BuyPersonal_Loan versus Income/Income_log

Data Preparation

Data Preprocessing

Prepare the data for analysis - Missing value Treatment, Outlier Detection(treat, if needed), Feature Engineering, Prepare data for modelling and check the split

Model Building - Logistic Regression

Logistic Regression Model performance evaluation and improvement

Comment on which metric is right for model performance evaluation and why? - Can model performance be improved? If yes, then do it using appropriate techniques for logistic regression and comment on model performance after improvement.

Logistic Regression

Finding the coefficients

Converting coefficients to odds

Odds from coefficients

Checking model performance on training set

Checking performance on test set

ROC-AUC on training set

ROC-AUC on test set

Checking model performance on training set

Precision-Recall curve and see if we can find a better threshold

Conclusion

Model Building - Decision Tree

Decision Tree Model performance evaluation and improvement

Try pruning technique(s) - Evaluate the model on appropriate metric - Comment on model performance

Model can make wrong predictions as:

  1. Predicting a customer will contribute to purchase of Personal loan but in reality the customer may not buy personal loans. - Waste of campaign funds

  2. Predicting a customer will NOT contribute to purchase of Personal loan but in reality the customer may buy personal loans. - Loss of opportunity for campaigns

How to reduce this loss i.e need to reduce False Negatives?

Some Functions need for recall score and Confusion Matrix

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Cost Complexity Pruning

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

checking performance on training set

Actionable Insights & Recommendations